Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study
نویسندگان
چکیده
This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100× 100 pixel square centered on the active speaker’s mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.
منابع مشابه
Robust audio-visual speech synchrony detection by generalized bimodal linear prediction
We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed ti...
متن کاملAudio-visual Multiple Active Speaker Localisation in Reverberant Environments
Localisation of multiple active speakers in natural environments with only two microphones is a challenging problem. Reverberation degrades the performance of speaker localisation based exclusively on directional cues. This paper presents an approach based on audio-visual fusion. The audio modality performs the multiple speaker localisation using the Skeleton method, energy weighting, and prece...
متن کاملModeling the Synchrony between Audio and Visual Modalities for Speaker Identification
This work aims to understand and model the inter-modal temporal relations between the audio and visual modalities of speech and validate whether the captured relations can improve the performance of audio-visual bimodal modeling for such applications as audio-visual speaker identification. We propose to extend our audio-visual correlative model (AVCM) with explicit durational modeling of the pa...
متن کاملDetecting audio-visual synchrony using deep neural networks
In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investig...
متن کاملAudio-visual synchronisation for speaker diarisation
The role of audio–visual speech synchrony for speaker diarisation is investigated on the multiparty meeting domain. We measured both mutual information and canonical correlation on different sets of audio and video features. As acoustic features we considered energy and MFCCs. As visual features we experimented both with motion intensity features, computed on the whole image, and Kanade Lucas T...
متن کامل